10 (a) Let us use the boston Dataset



In [1]:

    
library(MASS)
head(Boston)
# ?Boston for information









    





crim zn indus chas nox rm age dis rad tax ptratio black lstat medv

	0.00632 18     2.31   0      0.538  6.575  65.2   4.0900 1      296    15.3   396.90 4.98   24.0   
	0.02731  0     7.07   0      0.469  6.421  78.9   4.9671 2      242    17.8   396.90 9.14   21.6   
	0.02729  0     7.07   0      0.469  7.185  61.1   4.9671 2      242    17.8   392.83 4.03   34.7   
	0.03237  0     2.18   0      0.458  6.998  45.8   6.0622 3      222    18.7   394.63 2.94   33.4   
	0.06905  0     2.18   0      0.458  7.147  54.2   6.0622 3      222    18.7   396.90 5.33   36.2   
	0.02985  0     2.18   0      0.458  6.430  58.7   6.0622 3      222    18.7   394.12 5.21   28.7

How many rows and columns are in this dataset?



In [2]:

    
dim(Boston)

This Boston dataset has 506 sample rows and 14 columns (fields). Each row represents a Suburb in Boston and each column is a property of the suburb that helps determine the house pricing (which is the response variable) in the area.

10 (b) Let us create some pairwise scatter plots



In [22]:

    
pairs(Boston)

Since there are 14 predictors, scatter plot matrix becomes nearly illegible. Instead, we will get a birds eye view of our data using a correlation matrix.



In [12]:

    
corr_matrix = cor(Boston, method="pearson") # Generate Correlation Matrix
corr_matrix









    





crim zn indus chas nox rm age dis rad tax ptratio black lstat medv

	crim  1.00000000 -0.20046922  0.40658341 -0.055891582  0.42097171 -0.21924670  0.35273425 -0.37967009  0.625505145  0.58276431  0.2899456  -0.38506394  0.4556215  -0.3883046  
	zn -0.20046922  1.00000000 -0.53382819 -0.042696719 -0.51660371  0.31199059 -0.56953734  0.66440822 -0.311947826 -0.31456332 -0.3916785   0.17552032 -0.4129946   0.3604453  
	indus  0.40658341 -0.53382819  1.00000000  0.062938027  0.76365145 -0.39167585  0.64477851 -0.70802699  0.595129275  0.72076018  0.3832476  -0.35697654  0.6037997  -0.4837252  
	chas -0.05589158 -0.04269672  0.06293803  1.000000000  0.09120281  0.09125123  0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152   0.04878848 -0.0539293   0.1752602  
	nox  0.42097171 -0.51660371  0.76365145  0.091202807  1.00000000 -0.30218819  0.73147010 -0.76923011  0.611440563  0.66802320  0.1889327  -0.38005064  0.5908789  -0.4273208  
	rm -0.21924670  0.31199059 -0.39167585  0.091251225 -0.30218819  1.00000000 -0.24026493  0.20524621 -0.209846668 -0.29204783 -0.3555015   0.12806864 -0.6138083   0.6953599  
	age  0.35273425 -0.56953734  0.64477851  0.086517774  0.73147010 -0.24026493  1.00000000 -0.74788054  0.456022452  0.50645559  0.2615150  -0.27353398  0.6023385  -0.3769546  
	dis -0.37967009  0.66440822 -0.70802699 -0.099175780 -0.76923011  0.20524621 -0.74788054  1.00000000 -0.494587930 -0.53443158 -0.2324705   0.29151167 -0.4969958   0.2499287  
	rad  0.62550515 -0.31194783  0.59512927 -0.007368241  0.61144056 -0.20984667  0.45602245 -0.49458793  1.000000000  0.91022819  0.4647412  -0.44441282  0.4886763  -0.3816262  
	tax  0.58276431 -0.31456332  0.72076018 -0.035586518  0.66802320 -0.29204783  0.50645559 -0.53443158  0.910228189  1.00000000  0.4608530  -0.44180801  0.5439934  -0.4685359  
	ptratio  0.28994558 -0.39167855  0.38324756 -0.121515174  0.18893268 -0.35550149  0.26151501 -0.23247054  0.464741179  0.46085304  1.0000000  -0.17738330  0.3740443  -0.5077867  
	black -0.38506394  0.17552032 -0.35697654  0.048788485 -0.38005064  0.12806864 -0.27353398  0.29151167 -0.444412816 -0.44180801 -0.1773833   1.00000000 -0.3660869   0.3334608  
	lstat  0.45562148 -0.41299457  0.60379972 -0.053929298  0.59087892 -0.61380827  0.60233853 -0.49699583  0.488676335  0.54399341  0.3740443  -0.36608690  1.0000000  -0.7376627  
	medv -0.38830461  0.36044534 -0.48372516  0.175260177 -0.42732077  0.69535995 -0.37695457  0.24992873 -0.381626231 -0.46853593 -0.5077867   0.33346082 -0.7376627   1.0000000

The following observations were made:

crim is positively correlated with indus, age, nox, rad, tax, and lstat; but negatively correlated with dis, black and medv
zn is positively correlated with dis; and is negatively correlated with indus, nox, age, lstat.
indus is positively correlated with nox, age, rad, tax, lstat and negatively correlated with dis and medv.
nox is posetively correlated with age, rad, tax, lstat and negatively correlated with dis, medv.
rm is posetively correlated with medv and negatively correlated with lstat.
age is posetively correlated with tax , lstat and negatively correlated with dis.
dis is negatively correlated with tax , lstat, rad.
rad is postitively correlated with tax, _ptratio, lstat and negatively correlated with black.
tax is postitively correlated with _ptratio, lstat and negatively correlated with black, medv.
ptratio is negatively correlated with medv.
lstat is negatively correlated with medv.

10(c) From the correlation matrix and the scatter plot, we can make the following observations about crime rate crim.

Greater the number of non-retail business acres (indus) per down, greater is the crime rate.
Older are the houses (age), greater the crime rate
Greater the nitrogen oxide concentration, greater the crime rate.
Greater the index of accessiblity to radial highways, greater the highways.
Higher the taxes, greater the crime rate
Lower the population status, higher the crime rate.
Further away are the employment centers (dis), less is the crime rate.
Greater the number of Blacks, less is the crime rate.
Higher the median value of house (medv), lower is the crime rate

10 (d) The names of suburbs are not given in the dataset. Finding the suburbs with high crimerates, tax rates or pupil teacher ratios can only be done relatively using a histogram. Let us determine the distribution of crime rates for the suburbs in our dataset.



In [24]:

    
hist(Boston$crim, breaks=20, xlab="Crime Rate", main="Histogram of Crime Rates")

It is clear that most of the suburb samples have low crime rates. Now Lets take a look at tax rates.



In [25]:

    
hist(Boston$tax, breaks=20, xlab="Tax Rate", main="Histogram of Tax Rates")

There are a lot of houses with high tax rates up to 440, and then we have a lot of suburbs with tax rates around 680. Not many suburbs have tax rates between these figures. Let us now take a look at pupil teacher ratios.



In [26]:

    
hist(Boston$ptratio, breaks=20, xlab="Pupil Teacher Ratio", main="Histogram of Pupil Teacher Ratios")

The histogram for pupil teacher ratios seems well distributed, but with a particularly high ratio around 20 to 20.5. Let us find out exactly how many such suburbs exist.



In [27]:

    
length(Boston$ptratio[20 < Boston$ptratio & Boston$ptratio < 20.5])

So there are 145 suburbs in our dataset of 506 that have a high pupil teacher ratio between 20 and 20.5. Pretty Interesting!

10 (e) Let us determine the number of suburbs that bound the Charles River.



In [28]:

    
length(Boston$chas[Boston$chas == 1])

So there are 35 rivers that Bound Charles.

10(f) What is the median pupil teacher ratio for the towns in this dataset?



In [29]:

    
median(Boston$ptratio)

10 (g) Let us now find the suburb of Boston has lowest median value of owner-occupied homes.



In [30]:

    
index = which.min(Boston$medv) #Get index  minimum medv
Boston[index,] #Access this row.









    





crim zn indus chas nox rm age dis rad tax ptratio black lstat medv

	399 38.3518 0      18.1   0      0.693  5.453  100    1.4896 24     666    20.2   396.9  30.59  5

So the 399th suburb in the dataset has the lowest median value for owner occupied homes (medv = 5). Let us see the nature of the other fields.



In [31]:

    
percentile = ecdf(Boston$crim) #ecdf takes a vector and returns function for computing percentile.
print(paste("Crime Rate = ", percentile(Boston[index,'crim'])))#We can now compute the "percentile" of a value









    



[1] "Crime Rate =  0.988142292490119"

Let us iterate over all fields fast to get the big picture.



In [32]:

    
fields = names(Boston)
for (field in 1:length(fields)){
    percentile = ecdf(Boston[[field]])
    print(paste(fields[field], " = ", percentile(Boston[index,'crim'])))
}









    



[1] "crim  =  0.988142292490119"
[1] "zn  =  0.885375494071146"
[1] "indus  =  1"
[1] "chas  =  1"
[1] "nox  =  1"
[1] "rm  =  1"
[1] "age  =  0.205533596837945"
[1] "dis  =  1"
[1] "rad  =  1"
[1] "tax  =  0"
[1] "ptratio  =  1"
[1] "black  =  0.0355731225296443"
[1] "lstat  =  1"
[1] "medv  =  0.934782608695652"

Wow! These values are extreme relative to the dataset. From this result, we can conclude that this suburb with the lowest median value of owner-occupied homes also has:

One of the highest crime rates
A high proportion of residential land zoned for lots (zn)
one of the highest proportion of non-retail business acres per town (indus)
One of the highest Nitrogen Oxide concentrations (nox).
Furthest from employment centers (dis).
Lowest taxes.
The highest pupil-teacher ratio
Less blacks
One of the lowest population status (lstat)

10 (h) Let us find the number of suburbs that average over 7 rooms per dwelling.



In [34]:

    
length(Boston$rm[Boston$rm > 7])

So around 64 suburbs have greater than 7 rooms per dwelling on average. Now let us see how many suburbs exceed 8.



In [35]:

    
length(Boston$rm[Boston$rm > 8])

13 suburbs in our dataset have greater than 8 rooms per dwelling on average. Let us see what kind of suburbs these are.



In [36]:

    
Boston[Boston$rm > 8,]









    





crim zn indus chas nox rm age dis rad tax ptratio black lstat medv

	98 0.12083  0      2.89  0      0.4450 8.069  76.0   3.4952  2     276    18.0   396.90 4.21   38.7   
	164 1.51902  0     19.58  1      0.6050 8.375  93.9   2.1620  5     403    14.7   388.45 3.32   50.0   
	205 0.02009 95      2.68  0      0.4161 8.034  31.9   5.1180  4     224    14.7   390.55 2.88   50.0   
	225 0.31533  0      6.20  0      0.5040 8.266  78.3   2.8944  8     307    17.4   385.05 4.14   44.8   
	226 0.52693  0      6.20  0      0.5040 8.725  83.0   2.8944  8     307    17.4   382.00 4.63   50.0   
	227 0.38214  0      6.20  0      0.5040 8.040  86.5   3.2157  8     307    17.4   387.38 3.13   37.6   
	233 0.57529  0      6.20  0      0.5070 8.337  73.3   3.8384  8     307    17.4   385.91 2.47   41.7   
	234 0.33147  0      6.20  0      0.5070 8.247  70.4   3.6519  8     307    17.4   378.95 3.95   48.3   
	254 0.36894 22      5.86  0      0.4310 8.259   8.4   8.9067  7     330    19.1   396.90 3.54   42.8   
	258 0.61154 20      3.97  0      0.6470 8.704  86.9   1.8010  5     264    13.0   389.70 5.12   50.0   
	263 0.52014 20      3.97  0      0.6470 8.398  91.5   2.2885  5     264    13.0   386.86 5.91   48.8   
	268 0.57834 20      3.97  0      0.5750 8.297  67.0   2.4216  5     264    13.0   384.54 7.44   50.0   
	365 3.47428  0     18.10  1      0.7180 8.780  82.9   1.9047 24     666    20.2   354.55 5.29   21.9



In [37]:

    
summary( Boston[Boston$rm > 8,] )









    





      crim               zn            indus             chas       
 Min.   :0.02009   Min.   : 0.00   Min.   : 2.680   Min.   :0.0000  
 1st Qu.:0.33147   1st Qu.: 0.00   1st Qu.: 3.970   1st Qu.:0.0000  
 Median :0.52014   Median : 0.00   Median : 6.200   Median :0.0000  
 Mean   :0.71879   Mean   :13.62   Mean   : 7.078   Mean   :0.1538  
 3rd Qu.:0.57834   3rd Qu.:20.00   3rd Qu.: 6.200   3rd Qu.:0.0000  
 Max.   :3.47428   Max.   :95.00   Max.   :19.580   Max.   :1.0000  
      nox               rm             age             dis       
 Min.   :0.4161   Min.   :8.034   Min.   : 8.40   Min.   :1.801  
 1st Qu.:0.5040   1st Qu.:8.247   1st Qu.:70.40   1st Qu.:2.288  
 Median :0.5070   Median :8.297   Median :78.30   Median :2.894  
 Mean   :0.5392   Mean   :8.349   Mean   :71.54   Mean   :3.430  
 3rd Qu.:0.6050   3rd Qu.:8.398   3rd Qu.:86.50   3rd Qu.:3.652  
 Max.   :0.7180   Max.   :8.780   Max.   :93.90   Max.   :8.907  
      rad              tax           ptratio          black      
 Min.   : 2.000   Min.   :224.0   Min.   :13.00   Min.   :354.6  
 1st Qu.: 5.000   1st Qu.:264.0   1st Qu.:14.70   1st Qu.:384.5  
 Median : 7.000   Median :307.0   Median :17.40   Median :386.9  
 Mean   : 7.462   Mean   :325.1   Mean   :16.36   Mean   :385.2  
 3rd Qu.: 8.000   3rd Qu.:307.0   3rd Qu.:17.40   3rd Qu.:389.7  
 Max.   :24.000   Max.   :666.0   Max.   :20.20   Max.   :396.9  
     lstat           medv     
 Min.   :2.47   Min.   :21.9  
 1st Qu.:3.32   1st Qu.:41.7  
 Median :4.14   Median :48.3  
 Mean   :4.31   Mean   :44.2  
 3rd Qu.:5.12   3rd Qu.:50.0  
 Max.   :7.44   Max.   :50.0

Let us try to compare these stats to those of the entire dataset.



In [38]:

    
summary(Boston)









    





      crim                zn             indus            chas        
 Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
 1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
 Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
 Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
 3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
 Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
      nox               rm             age              dis        
 Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
 1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
 Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
 Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
 3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
 Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
      rad              tax           ptratio          black       
 Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
 1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
 Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
 Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
 3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
 Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
     lstat            medv      
 Min.   : 1.73   Min.   : 5.00  
 1st Qu.: 6.95   1st Qu.:17.02  
 Median :11.36   Median :21.20  
 Mean   :12.65   Mean   :22.53  
 3rd Qu.:16.95   3rd Qu.:25.00  
 Max.   :37.97   Max.   :50.00

Some noticible factors for the suburbs with over 8 rooms per dwelling on average:

The range of crime rate (crim) is much lower
The range of lower population status (lstat) is much lower
The other fields seems to be similar for all suburbs of the dataset

crim	zn	indus	nox	rm	age	dis	rad	tax	ptratio	black	lstat	medv
0.00632	18	2.31	0.538	6.575	65.2	4.0900	1	296	15.3	396.90	4.98	24.0
0.02731	0	7.07	0.469	6.421	78.9	4.9671	2	242	17.8	396.90	9.14	21.6
0.02729	0	7.07	0.469	7.185	61.1	4.9671	2	242	17.8	392.83	4.03	34.7
0.03237	0	2.18	0.458	6.998	45.8	6.0622	3	222	18.7	394.63	2.94	33.4
0.06905	0	2.18	0.458	7.147	54.2	6.0622	3	222	18.7	396.90	5.33	36.2
0.02985	0	2.18	0.458	6.430	58.7	6.0622	3	222	18.7	394.12	5.21	28.7

	crim	zn	indus	chas	nox	rm	age	dis	rad	tax	ptratio	black	lstat	medv
crim	1.00000000	-0.20046922	0.40658341	-0.055891582	0.42097171	-0.21924670	0.35273425	-0.37967009	0.625505145	0.58276431	0.2899456	-0.38506394	0.4556215	-0.3883046
zn	-0.20046922	1.00000000	-0.53382819	-0.042696719	-0.51660371	0.31199059	-0.56953734	0.66440822	-0.311947826	-0.31456332	-0.3916785	0.17552032	-0.4129946	0.3604453
indus	0.40658341	-0.53382819	1.00000000	0.062938027	0.76365145	-0.39167585	0.64477851	-0.70802699	0.595129275	0.72076018	0.3832476	-0.35697654	0.6037997	-0.4837252
chas	-0.05589158	-0.04269672	0.06293803	1.000000000	0.09120281	0.09125123	0.08651777	-0.09917578	-0.007368241	-0.03558652	-0.1215152	0.04878848	-0.0539293	0.1752602
nox	0.42097171	-0.51660371	0.76365145	0.091202807	1.00000000	-0.30218819	0.73147010	-0.76923011	0.611440563	0.66802320	0.1889327	-0.38005064	0.5908789	-0.4273208
rm	-0.21924670	0.31199059	-0.39167585	0.091251225	-0.30218819	1.00000000	-0.24026493	0.20524621	-0.209846668	-0.29204783	-0.3555015	0.12806864	-0.6138083	0.6953599
age	0.35273425	-0.56953734	0.64477851	0.086517774	0.73147010	-0.24026493	1.00000000	-0.74788054	0.456022452	0.50645559	0.2615150	-0.27353398	0.6023385	-0.3769546
dis	-0.37967009	0.66440822	-0.70802699	-0.099175780	-0.76923011	0.20524621	-0.74788054	1.00000000	-0.494587930	-0.53443158	-0.2324705	0.29151167	-0.4969958	0.2499287
rad	0.62550515	-0.31194783	0.59512927	-0.007368241	0.61144056	-0.20984667	0.45602245	-0.49458793	1.000000000	0.91022819	0.4647412	-0.44441282	0.4886763	-0.3816262
tax	0.58276431	-0.31456332	0.72076018	-0.035586518	0.66802320	-0.29204783	0.50645559	-0.53443158	0.910228189	1.00000000	0.4608530	-0.44180801	0.5439934	-0.4685359
ptratio	0.28994558	-0.39167855	0.38324756	-0.121515174	0.18893268	-0.35550149	0.26151501	-0.23247054	0.464741179	0.46085304	1.0000000	-0.17738330	0.3740443	-0.5077867
black	-0.38506394	0.17552032	-0.35697654	0.048788485	-0.38005064	0.12806864	-0.27353398	0.29151167	-0.444412816	-0.44180801	-0.1773833	1.00000000	-0.3660869	0.3334608
lstat	0.45562148	-0.41299457	0.60379972	-0.053929298	0.59087892	-0.61380827	0.60233853	-0.49699583	0.488676335	0.54399341	0.3740443	-0.36608690	1.0000000	-0.7376627
medv	-0.38830461	0.36044534	-0.48372516	0.175260177	-0.42732077	0.69535995	-0.37695457	0.24992873	-0.381626231	-0.46853593	-0.5077867	0.33346082	-0.7376627	1.0000000

	crim	zn	indus	chas	nox	rm	age	dis	rad	tax	ptratio	black	lstat	medv
98	0.12083	0	2.89	0	0.4450	8.069	76.0	3.4952	2	276	18.0	396.90	4.21	38.7
164	1.51902	0	19.58	1	0.6050	8.375	93.9	2.1620	5	403	14.7	388.45	3.32	50.0
205	0.02009	95	2.68	0	0.4161	8.034	31.9	5.1180	4	224	14.7	390.55	2.88	50.0
225	0.31533	0	6.20	0	0.5040	8.266	78.3	2.8944	8	307	17.4	385.05	4.14	44.8
226	0.52693	0	6.20	0	0.5040	8.725	83.0	2.8944	8	307	17.4	382.00	4.63	50.0
227	0.38214	0	6.20	0	0.5040	8.040	86.5	3.2157	8	307	17.4	387.38	3.13	37.6
233	0.57529	0	6.20	0	0.5070	8.337	73.3	3.8384	8	307	17.4	385.91	2.47	41.7
234	0.33147	0	6.20	0	0.5070	8.247	70.4	3.6519	8	307	17.4	378.95	3.95	48.3
254	0.36894	22	5.86	0	0.4310	8.259	8.4	8.9067	7	330	19.1	396.90	3.54	42.8
258	0.61154	20	3.97	0	0.6470	8.704	86.9	1.8010	5	264	13.0	389.70	5.12	50.0
263	0.52014	20	3.97	0	0.6470	8.398	91.5	2.2885	5	264	13.0	386.86	5.91	48.8
268	0.57834	20	3.97	0	0.5750	8.297	67.0	2.4216	5	264	13.0	384.54	7.44	50.0
365	3.47428	0	18.10	1	0.7180	8.780	82.9	1.9047	24	666	20.2	354.55	5.29	21.9